A visual exploration of gradient descent optimizers, including Gradient Descent, Stochastic Gradient Descent (SGD), Momentum, AdaGrad, RMSProp, and Adam, comparing their convergence behavior, efficiency, and suitability across various optimization landscapes and real-world datasets.
Author
Affiliation
Stat Squad
School of Information, University of Arizona
Abstract
This project aims to provide an in-depth visual analysis of various gradient descent-based optimization algorithms, including Gradient Descent, Stochastic Gradient Descent (SGD), Momentum, AdaGrad, RMSProp, and Adam. By simulating these optimizers on different synthetic loss landscapes and applying them to a real-world dataset, the project will explore their distinct convergence behaviors, efficiency, and adaptability. Through dynamic visualizations and interactive tools, the project will help users understand the strengths and limitations of each optimizer, enabling better selection of optimization techniques for different machine learning tasks.
Introduction
Optimization is a fundamental component of machine learning, as it drives the process of training models by minimizing a loss function. One of the most widely used optimization techniques is gradient descent, which iteratively adjusts model parameters to reduce the error in predictions. However, the basic gradient descent algorithm has limitations, such as slow convergence and sensitivity to the choice of learning rate. To address these challenges, several optimizers have been developed to improve the efficiency, stability, and adaptability of gradient descent.
This project focuses on six popular gradient descent-based optimizers: Gradient Descent, Stochastic Gradient Descent (SGD), Momentum, AdaGrad, RMSProp, and Adam. These optimizers differ in how they adjust the learning rate, incorporate past gradients, and handle noisy data. While each optimizer has its advantages and drawbacks, understanding their behavior and performance on different types of loss functions is crucial for selecting the most appropriate method for a given machine learning task.
The goal of this project is to visually demonstrate the convergence behaviors and efficiency of these optimizers in various optimization landscapes, such as convex, non-convex, and saddle-point surfaces. Using both synthetic data and real-world examples, the project will explore how each optimizer performs across different types of loss surfaces. By generating dynamic, animated visualizations, we aim to provide a deeper understanding of how each optimizer functions, offering a valuable tool for students and practitioners to choose the right optimization technique for their machine learning models.
Approach:
1. Synthetic Data
Objective: Explore optimizer performance on different synthetic loss landscapes (convex, non-convex, saddle point).
Gradient Descent: Apply on quadratic function (convex).
Stochastic Gradient Descent (SGD): Apply on quadratic function, check for noisy convergence.
Momentum: Apply on quadratic function and Rosenbrock function to test smoother convergence.
AdaGrad: Apply on non-convex and saddle point functions to see how learning rate adapts.
RMSProp: Test on non-convex and saddle point functions for faster convergence.
Adam: Apply on all landscapes (quadratic, Rosenbrock, saddle point) to compare convergence stability and speed.
Exploring Optimizers on Synthetic Data
Synthetic data helps us simulate and study optimizer behavior in controlled environments where we know the true function and its minimum. This allows for a more straightforward analysis of how optimizers perform in different loss landscapes. For the synthetic data, we will generate loss surfaces for three different types: convex, non-convex, and saddle-point.
Generating Synthetic Data in R:
Quadratic Function (Convex Surface):
Formula: \[f(x, y) = x^2 + y^2\]
Minimum: Located at (0,0).
Description: A smooth, convex paraboloid that is symmetric around the origin, making it a straightforward example to observe the convergence path.
Rosenbrock Function (Non-Convex Surface):
Formula: \[f(x, y) = (a - x)^2 + b \cdot (y - x^2)^2\]
Parameters: Typically, a=1 and b=100.
Minimum: Located at (a, a^2), typically (1,1) with the default parameters.
Description: This non-convex function has a narrow valley that requires careful adjustment to navigate. It’s often used to highlight how optimizers handle complex landscapes.
Saddle Point Function (Mixed Curvature Surface):
Formula: \[f(x, y) = x^2 - y^2\]
Saddle Point: Located at (0,0).
Description: This function has a saddle point at the origin. It’s concave along one axis and convex along the other, providing a useful case for understanding optimizers’ sensitivity to directional changes in curvature.
Gradient Descent
Mathematical Intuition: Gradient Descent updates parameters based on the negative gradient of the loss function. For a function f(x)f(x), the update rule is:
where ηη is the learning rate and ∇f(θt)∇f(θt) is the gradient at point θtθt.
R Implementation:
Explanation: In this visualization, we start at the initial point ( 4 , 4 ) on the surface of the quadratic function 𝑓 ( 𝑥 , 𝑦 ) = 𝑥 2 + 𝑦 2. The gradient descent algorithm iteratively updates the values of 𝑥 and 𝑦 by moving in the direction of the steepest descent, determined by the negative gradient of the function.
At each step, arrows are drawn from the current position to the updated position, showing the direction and magnitude of the movement. The length of each arrow represents the step size, which decreases as we approach the minimum due to the nature of the gradient being proportional to the distance from the origin.
The trajectory clearly illustrates how gradient descent converges toward the global minimum at ( 0 , 0 ), where the function achieves its lowest value. This process highlights how gradient descent works effectively in a smooth, convex landscape like this quadratic function.
Stochastic Gradient Descent (SGD)
Mathematical Intuition: In SGD, instead of using the full gradient, we compute the gradient using a single random data point. This makes the process noisy but can be more efficient in large datasets.
R Implementation:
Explanation: In this visualization, we begin at the initial point ( 4 , 4 ) and update the values of 𝑥 and 𝑦 at each step using the stochastic gradient descent algorithm. Instead of calculating the full gradient, noise is added to the gradient at each iteration to simulate a stochastic environment.
Arrows are used to represent the direction and magnitude of the updates, showing the noisy movement toward the global minimum at ( 0 , 0 ). Unlike standard gradient descent, stochastic gradient descent introduces small oscillations due to the random noise, which can help in escaping local minima in more complex landscapes.
The visualization highlights the efficiency of stochastic gradient descent in large datasets where calculating the full gradient at each step may be computationally expensive. The oscillations, however, underscore the trade-off between stability and speed in optimization.
Momentum Optimizer
Mathematical Intuition: Momentum is an optimization technique that helps accelerate gradient descent by adding a fraction of the previous update to the current update. This reduces oscillations and helps in faster convergence.
ββ is the momentum coefficient (usually close to 1),
ηη is the learning rate,
∇f(θt)∇f(θt) is the gradient at the current step.
Momentum helps speed up the convergence by adding a fraction of the previous velocity to the current gradient.
R Implementation:
Explanation: This visualization starts at the initial point (4,4) and simulates the updates using the momentum optimization technique. At each step
The gradient is calculated, and a fraction of the previous velocity is added to the current gradient update.
This velocity term reduces oscillations and accelerates convergence toward the global minimum.
Arrows indicate the direction and magnitude of updates. The momentum term helps the optimizer move smoothly and quickly, demonstrating how it handles optimization in a convex landscape.
AdaGrad Optimizer
Mathematical Intuition: AdaGrad adapts the learning rate for each parameter based on the historical gradients. Parameters that have frequently large gradients get smaller updates, and parameters with infrequent large gradients get larger updates.
ϵϵ is a small constant to prevent division by zero.
AdaGrad ensures that frequently updated parameters have smaller learning rates.
R Implementation:
Explanation: We start at the initial point ( 4 , 4 ) on the quadratic surface 𝑓 ( 𝑥 , 𝑦 ) = 𝑥 2 + 𝑦 2 . The AdaGrad optimizer updates the learning rate for each parameter adaptively, based on the accumulated squared gradients. Parameters with frequently large gradients get smaller updates. Parameters with infrequent large gradients get larger updates. Arrows represent the direction and magnitude of updates at each step. The trajectory converges toward the global minimum at ( 0 , 0 ), demonstrating how AdaGrad adjusts its learning rate dynamically to achieve convergence in a smooth, convex optimization landscape.
RMSProp Optimizer
Mathematical Intuition: RMSProp is an adaptive learning rate method that divides the learning rate by a moving average of the squared gradients. It works similarly to AdaGrad but with a smoothing term to prevent the learning rate from decaying too quickly.
Update Rule:
Compute the first moment (gradient) gtgt:
\[
g_t = \nabla f(\theta_t)
\]
Update the second moment vtvt with exponential decay:
\[
v_t = \beta v_{t-1} + (1 - \beta) g_t^2
\]
Update the parameter \[\theta_{t+1}\] using the RMSprop rule:
ββ is the smoothing constant, typically set to 0.9,
ϵϵ is a small constant for numerical stability.
RMSProp adapts the learning rate based on recent gradient information, helping to prevent the learning rate from shrinking too quickly.
R Implementation:
Explanation:
We start at the initial point (4,4) on the quadratic surface f(x, y) = x^2 + y^2.
The RMSProp optimizer updates the learning rate dynamically using a moving average of the squared gradients. This ensures:
Gradients that occur frequently do not overly diminish the learning rate.
Gradients that occur rarely still allow meaningful updates.
The momentum-like smoothing term (β) helps prevent the learning rate from decaying too quickly, unlike AdaGrad.
Arrows represent the direction and magnitude of updates, showcasing the efficiency of RMSProp in converging toward the global minimum at (0,0). This visualization highlights the balance RMSProp strikes between learning rate adaptability and update stability.
Adam Optimizer
Mathematical Intuition: Adam (Adaptive Moment Estimation) combines the ideas of both Momentum and RMSProp. It tracks both the first moment (mean) and second moment (variance) of the gradients to adapt the learning rate for each parameter. It also uses bias correction to account for the initialization of the first and second moment estimates.
Update Rule:
First Moment Estimate (Momentum):\[
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
\]
Where \[m_t\] is the first moment estimate (momentum) and \[g_t\] is the gradient at time t.
Second Moment Estimate (RMSprop-like):
\[
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
\]
Where \[v_t\] is the second moment estimate (similar to RMSprop).
Bias-Corrected First Moment Estimate:
\[\hat{m}_t = \frac{m_t}{1 - \beta_1^t}\]
This corrects the bias introduced by the initialization of mtmt.
Bias-Corrected Second Moment Estimate:
\[\hat{v}_t = \frac{v_t}{1 - \beta_2^t}\]
This corrects the bias introduced by the initialization of vtvt.
Adam is highly effective and widely used due to its ability to adapt the learning rate(based on the gradient’s first and second moments) and improve convergence for a wide variety of tasks.
R Implementation:
Explanation:
The optimizer starts at the initial point (4,4) on the quadratic surface f(x, y) = x^2 + y^2.
Adam combines the benefits of:
Momentum: Using the first moment estimate (mtm_tmt) to smooth updates and reduce oscillations.
RMSProp: Using the second moment estimate (vtv_tvt) to adapt the learning rate dynamically for each parameter.
Bias correction ensures that the first few updates are unbiased, despite starting with zero moments.
Arrows represent the direction and magnitude of updates at each step, showing the optimizer’s ability to adapt learning rates and converge efficiently to the global minimum at (0,0).
Visualizing the different optimizers behavior
Efficiency Comparison
Speed:
Fastest: Adam, RMSProp, and SGD (for large datasets).
Slowest: Gradient Descent due to computation over the entire dataset at each step.
Convergence Stability:
Stable: Momentum, Adam, and RMSProp are robust to oscillations.
Noisy: SGD can oscillate or overshoot due to randomness.
Adaptability:
Most Adaptive: Adam (adjusts learning rate dynamically for each parameter).
Least Adaptive: Gradient Descent (fixed learning rate for all parameters).
Handling Sparse Data:
Best: AdaGrad and Adam are designed for sparse gradients.
Worst: Gradient Descent and Momentum may struggle without careful tuning.
Conclusion
Adam is widely regarded as the most efficient and versatile optimizer for general use cases due to its adaptive nature, fast convergence, and stability. However, for very specific tasks, other methods like RMSProp or AdaGrad may perform better.
2D static images
Comparison Between Optimizer’s
Optimizer
Description
Pros
Cons
Best Use
Gradient Descent
Full gradient each update
Stable convergence
Slow, costly
Small, static datasets
SGD
One sample/batch per update
Fast, escapes minima
High variance, less stable
Large datasets, limited resources
Momentum
Adds momentum to smooth updates
Faster, reduces oscillation
Adds tuning complexity
Oscillatory paths
AdaGrad
Adapts learning rate per parameter
Good for sparse data
Learning rate decays
Sparse datasets
RMSProp
Moving avg. squared gradients
Solves AdaGrad decay issue
Needs decay rate tuning
Non-stationary tasks (e.g., RNNs)
Adam
Combines Momentum + RMSProp
Fast, robust to noise
Complex, more tuning needed
General-purpose neural networks
Source Code
---title: "Deep Learning Optimizer Dynamics"subtitle: "INFO 526 - Fall 2024 - Project #2"author: - name: "Stat Squad" affiliations: - name: "School of Information, University of Arizona"description: "A visual exploration of gradient descent optimizers, including Gradient Descent, Stochastic Gradient Descent (SGD), Momentum, AdaGrad, RMSProp, and Adam, comparing their convergence behavior, efficiency, and suitability across various optimization landscapes and real-world datasets."format: html: code-tools: true code-overflow: wrap embed-resources: trueeditor: visualexecute: warning: false echo: false---## AbstractThis project aims to provide an in-depth visual analysis of various gradient descent-based optimization algorithms, including Gradient Descent, Stochastic Gradient Descent (SGD), Momentum, AdaGrad, RMSProp, and Adam. By simulating these optimizers on different synthetic loss landscapes and applying them to a real-world dataset, the project will explore their distinct convergence behaviors, efficiency, and adaptability. Through dynamic visualizations and interactive tools, the project will help users understand the strengths and limitations of each optimizer, enabling better selection of optimization techniques for different machine learning tasks.## IntroductionOptimization is a fundamental component of machine learning, as it drives the process of training models by minimizing a loss function. One of the most widely used optimization techniques is gradient descent, which iteratively adjusts model parameters to reduce the error in predictions. However, the basic gradient descent algorithm has limitations, such as slow convergence and sensitivity to the choice of learning rate. To address these challenges, several optimizers have been developed to improve the efficiency, stability, and adaptability of gradient descent.This project focuses on six popular gradient descent-based optimizers: **Gradient Descent**, **Stochastic Gradient Descent (SGD)**, **Momentum**, **AdaGrad**, **RMSProp**, and **Adam**. These optimizers differ in how they adjust the learning rate, incorporate past gradients, and handle noisy data. While each optimizer has its advantages and drawbacks, understanding their behavior and performance on different types of loss functions is crucial for selecting the most appropriate method for a given machine learning task.The goal of this project is to visually demonstrate the convergence behaviors and efficiency of these optimizers in various optimization landscapes, such as convex, non-convex, and saddle-point surfaces. Using both synthetic data and real-world examples, the project will explore how each optimizer performs across different types of loss surfaces. By generating dynamic, animated visualizations, we aim to provide a deeper understanding of how each optimizer functions, offering a valuable tool for students and practitioners to choose the right optimization technique for their machine learning models.```{r}#| label: load-packages#| include: false#| cache: true# Load packages herepacman::p_load(tidyverse, tidytuesdayR, broom, kableExtra, ggplot2, gganimate, dplyr, tidyr, viridis, ggrepel, gifski)# install.packages('viridis')# install.packages('gganimate')library(viridis)library(gganimate)```## **Approach:**#### **1. Synthetic Data**- **Objective:** Explore optimizer performance on different synthetic loss landscapes (convex, non-convex, saddle point). - **Gradient Descent**: Apply on quadratic function (convex). - **Stochastic Gradient Descent (SGD)**: Apply on quadratic function, check for noisy convergence. - **Momentum**: Apply on quadratic function and Rosenbrock function to test smoother convergence. - **AdaGrad**: Apply on non-convex and saddle point functions to see how learning rate adapts. - **RMSProp**: Test on non-convex and saddle point functions for faster convergence. - **Adam**: Apply on all landscapes (quadratic, Rosenbrock, saddle point) to compare convergence stability and speed.## Exploring Optimizers on Synthetic DataSynthetic data helps us simulate and study optimizer behavior in controlled environments where we know the true function and its minimum. This allows for a more straightforward analysis of how optimizers perform in different loss landscapes. For the synthetic data, we will generate loss surfaces for three different types: convex, non-convex, and saddle-point.### **Generating Synthetic Data in R:**- ***Quadratic Function (Convex Surface)***: - **Formula**: $$f(x, y) = x^2 + y^2$$ - **Minimum**: Located at (0,0). - **Description**: A smooth, convex paraboloid that is symmetric around the origin, making it a straightforward example to observe the convergence path.```{r eval=TRUE, warning=FALSE} x <- seq(-5, 5, length.out = 100) y <- seq(-5, 5, length.out = 100) grid <- expand.grid(x = x, y = y) grid$z <- grid$x^2 + grid$y^2 library(ggplot2) ggplot(grid, aes(x = x, y = y, z = z)) + geom_contour_filled() # + # labs(title = "Quadratic Function: f(x, y) = x^2 + y^2") ```- ***Rosenbrock Function (Non-Convex Surface)***: - **Formula**: $$f(x, y) = (a - x)^2 + b \cdot (y - x^2)^2$$ - **Parameters**: Typically, a=1 and b=100. - **Minimum**: Located at (a, a\^2), typically (1,1) with the default parameters. - **Description**: This non-convex function has a narrow valley that requires careful adjustment to navigate. It's often used to highlight how optimizers handle complex landscapes.```{r eval=TRUE, warning=FALSE} a <- 1 b <- 100 x <- seq(-2, 2, length.out = 100) y <- seq(-1, 3, length.out = 100) grid <- expand.grid(x = x, y = y) grid$z <- (a - grid$x)^2 + b * (grid$y - grid$x^2)^2 ggplot(grid, aes(x = x, y = y, z = z)) + geom_contour_filled() # + # labs(title = "Rosenbrock Function: f(x, y) = (a - x)^2 + b * (y - x^2)^2") ```- ***Saddle Point Function (Mixed Curvature Surface)***: - **Formula**: $$f(x, y) = x^2 - y^2$$ - **Saddle Point**: Located at (0,0). - **Description**: This function has a saddle point at the origin. It’s concave along one axis and convex along the other, providing a useful case for understanding optimizers' sensitivity to directional changes in curvature.```{r eval=TRUE, warning=FALSE} x <- seq(-5, 5, length.out = 100) y <- seq(-5, 5, length.out = 100) grid <- expand.grid(x = x, y = y) grid$z <- grid$x^2 - grid$y^2 ggplot(grid, aes(x = x, y = y, z = z)) + geom_contour_filled() # + # labs(title = "Saddle Point Function: f(x, y) = x^2 - y^2") ```### **Gradient Descent**- **Mathematical Intuition:** Gradient Descent updates parameters based on the negative gradient of the loss function. For a function f(x)f(x), the update rule is: $$\theta_{t+1} = \theta_t - \eta \nabla f(\theta_t)$$ where ηη is the learning rate and ∇f(θt)∇f(θt) is the gradient at point θtθt.- **R Implementation:**```{r} # # install.packages('viridis') # library(gganimate) # library(viridis) # library(ggplot2) # # # Gradient Descent Function # gradient_descent <- function(x_init, y_init, learning_rate, n_iter) { # x <- x_init # y <- y_init # path <- data.frame(step = 0, x = x, y = y, xend = NA, yend = NA) # # for (i in 1:n_iter) { # grad_x <- 2 * x # Gradient with respect to x # grad_y <- 2 * y # Gradient with respect to y # dx <- -learning_rate * grad_x # dy <- -learning_rate * grad_y # x_new <- x + dx # y_new <- y + dy # path <- rbind(path, data.frame(step = i, x = x, y = y, xend = x_new, yend = y_new)) # x <- x_new # y <- y_new # } # # return(path) # } # # # Initial parameters # x_init <- 4 # y_init <- 4 # learning_rate <- 0.1 # n_iter <- 50 # # # Generate path # path <- gradient_descent(x_init, y_init, learning_rate, n_iter) # # # Remove rows with missing xend or yend values # path <- na.omit(path) # # # Create the quadratic function surface # x_vals <- seq(-5, 5, length.out = 100) # y_vals <- seq(-5, 5, length.out = 100) # grid <- expand.grid(x = x_vals, y = y_vals) # grid$z <- grid$x^2 + grid$y^2 # # # Plotting the gradient descent trajectory with arrows # plot <- ggplot() + # geom_raster(data = grid, aes(x = x, y = y, fill = z), alpha = 0.8) + # scale_fill_viridis(name = "f(x, y)", option = "viridis") + # geom_segment( # data = path, # aes(x = x, y = y, xend = xend, yend = yend, group = 1), # arrow = arrow(type = "closed", length = unit(0.2, "inches")), # color = "white", # linewidth = 0.8 # ) + # labs( # title = "Gradient Descent Trajectory on f(x, y) = x^2 + y^2", # subtitle = "Step: {closest_state}", # x = "x", # y = "y" # ) + # coord_cartesian(xlim = c(-2, 5), ylim = c(-2, 5)) + # Set axis limits # theme_minimal() + # theme( # plot.title = element_text(size = 22, face = "bold"), # plot.subtitle = element_text(size = 24, color = 'darkblue'), # axis.title = element_text(size = 18), # legend.title = element_text(size = 18), # legend.text = element_text(size = 14), # axis.text = element_text(size = 14) # ) + # transition_states(path$step, transition_length = 2, state_length = 1) + # shadow_mark(size = 0.5, color = "gray") # # # Save as a GIF # anim_save("presentation_files/gradient_descent_arrows.gif", animation = animate(plot, fps = 5, width = 800, height = 600)) # # cat("GIF saved as 'gradient_descent_arrows.gif'\n") ``````{r show-gif-1, echo=FALSE, out.width="80%", fig.align="center"} knitr::include_graphics("presentation_files/gradient_descent_arrows.gif") ```- **Explanation:** In this visualization, we start at the initial point ( 4 , 4 ) on the surface of the quadratic function 𝑓 ( 𝑥 , 𝑦 ) = 𝑥 2 + 𝑦 2. The gradient descent algorithm iteratively updates the values of 𝑥 and 𝑦 by moving in the direction of the steepest descent, determined by the negative gradient of the function.- At each step, arrows are drawn from the current position to the updated position, showing the direction and magnitude of the movement. The length of each arrow represents the step size, which decreases as we approach the minimum due to the nature of the gradient being proportional to the distance from the origin.- The trajectory clearly illustrates how gradient descent converges toward the global minimum at ( 0 , 0 ), where the function achieves its lowest value. This process highlights how gradient descent works effectively in a smooth, convex landscape like this quadratic function.### **Stochastic Gradient Descent (SGD)**- **Mathematical Intuition:** In **SGD**, instead of using the full gradient, we compute the gradient using a single random data point. This makes the process noisy but can be more efficient in large datasets.- **R Implementation:**```{r } # library(viridis) # library(gganimate) # # # Stochastic Gradient Descent Function # stochastic_gradient_descent <- function(x_init, y_init, learning_rate, n_iter) { # x <- x_init # y <- y_init # path <- data.frame(step = 0, x = x, y = y, xend = NA, yend = NA) # # for (i in 1:n_iter) { # grad_x <- 2 * x + runif(1, -0.1, 0.1) # Add noise to gradient for x # grad_y <- 2 * y + runif(1, -0.1, 0.1) # Add noise to gradient for y # dx <- -learning_rate * grad_x # dy <- -learning_rate * grad_y # x_new <- x + dx # y_new <- y + dy # path <- rbind(path, data.frame(step = i, x = x, y = y, xend = x_new, yend = y_new)) # x <- x_new # y <- y_new # } # # return(path) # } # # # Initial parameters # x_init <- 4 # y_init <- 4 # learning_rate <- 0.1 # n_iter <- 50 # # # Generate path # path_sgd <- stochastic_gradient_descent(x_init, y_init, learning_rate, n_iter) # # # Remove rows with missing xend or yend values # path_sgd <- na.omit(path_sgd) # # # Create the quadratic function surface # x_vals <- seq(-5, 5, length.out = 100) # y_vals <- seq(-5, 5, length.out = 100) # grid <- expand.grid(x = x_vals, y = y_vals) # grid$z <- grid$x^2 + grid$y^2 # # # Plotting the stochastic gradient descent trajectory with arrows # plot_sgd <- ggplot() + # geom_raster(data = grid, aes(x = x, y = y, fill = z), alpha = 0.8) + # scale_fill_viridis(name = "f(x, y)", option = "viridis") + # geom_segment( # data = path_sgd, # aes(x = x, y = y, xend = xend, yend = yend, group = 1), # arrow = arrow(type = "closed", length = unit(0.2, "inches")), # color = "white", # linewidth = 0.8 # ) + # labs( # title = "Stochastic Gradient Descent (SGD) on f(x, y) = x^2 + y^2", # subtitle = "Step: {closest_state}", # x = "x", # y = "y" # ) + # coord_cartesian(xlim = c(-2, 5), ylim = c(-2, 5)) + # Set axis limits # theme_minimal() + # theme( # plot.title = element_text(size = 22, face = "bold"), # plot.subtitle = element_text(size = 24, color = 'darkblue'), # axis.title = element_text(size = 18), # legend.title = element_text(size = 18), # legend.text = element_text(size = 14), # axis.text = element_text(size = 14) # ) + # transition_states(path_sgd$step, transition_length = 2, state_length = 1) + # shadow_mark(size = 0.5, color = "gray") # # # Save as a GIF # anim_save("presentation_files/stochastic_gradient_descent_arrows.gif", animation = animate(plot_sgd, fps = 5, width = 800, height = 600)) # # cat("GIF saved as 'stochastic_gradient_descent_arrows.gif'\n") ``````{r show-gif-2, echo=FALSE, out.width="80%", fig.align="center"} knitr::include_graphics("presentation_files/stochastic_gradient_descent_arrows.gif") ```- - **Explanation:** In this visualization, we begin at the initial point ( 4 , 4 ) and update the values of 𝑥 and 𝑦 at each step using the stochastic gradient descent algorithm. Instead of calculating the full gradient, noise is added to the gradient at each iteration to simulate a stochastic environment.- Arrows are used to represent the direction and magnitude of the updates, showing the noisy movement toward the global minimum at ( 0 , 0 ). Unlike standard gradient descent, stochastic gradient descent introduces small oscillations due to the random noise, which can help in escaping local minima in more complex landscapes.- The visualization highlights the efficiency of stochastic gradient descent in large datasets where calculating the full gradient at each step may be computationally expensive. The oscillations, however, underscore the trade-off between stability and speed in optimization.### **Momentum Optimizer**- **Mathematical Intuition:** Momentum is an optimization technique that helps accelerate gradient descent by adding a fraction of the previous update to the current update. This reduces oscillations and helps in faster convergence.- **Update Rule:** $$v_t = \beta v_{t-1} + (1 - \beta) \nabla f(\theta_t)$$ $$\theta_{t+1} = \theta_t - \eta v_t$$ where: - vtvt is the velocity (momentum), - ββ is the momentum coefficient (usually close to 1), - ηη is the learning rate, - ∇f(θt)∇f(θt) is the gradient at the current step.- Momentum helps speed up the convergence by adding a fraction of the previous velocity to the current gradient.- **R Implementation:**```{r} # # Momentum Optimization on Quadratic Function # library(viridis) # library(gganimate) # # # Momentum Optimizer Function # momentum_optimizer <- function(x_init, y_init, learning_rate, beta, n_iter) { # x <- x_init # y <- y_init # v_x <- 0 # Initialize velocity for x # v_y <- 0 # Initialize velocity for y # path <- data.frame(step = 0, x = x, y = y, xend = NA, yend = NA) # # for (i in 1:n_iter) { # grad_x <- 2 * x # Gradient with respect to x # grad_y <- 2 * y # Gradient with respect to y # # # Update velocity # v_x <- beta * v_x + (1 - beta) * grad_x # v_y <- beta * v_y + (1 - beta) * grad_y # # # Update parameters # dx <- -learning_rate * v_x # dy <- -learning_rate * v_y # x_new <- x + dx # y_new <- y + dy # # path <- rbind(path, data.frame(step = i, x = x, y = y, xend = x_new, yend = y_new)) # x <- x_new # y <- y_new # } # # return(path) # } # # # Initial parameters # x_init <- 4 # y_init <- 4 # learning_rate <- 0.1 # beta <- 0.9 # n_iter <- 50 # # # Generate path # path_momentum <- momentum_optimizer(x_init, y_init, learning_rate, beta, n_iter) # # # Remove rows with missing xend or yend values # path_momentum <- na.omit(path_momentum) # # # Create the quadratic function surface # x_vals <- seq(-5, 5, length.out = 100) # y_vals <- seq(-5, 5, length.out = 100) # grid <- expand.grid(x = x_vals, y = y_vals) # grid$z <- grid$x^2 + grid$y^2 # # # Plotting the momentum optimizer trajectory with arrows # plot_momentum <- ggplot() + # geom_raster(data = grid, aes(x = x, y = y, fill = z), alpha = 0.8) + # scale_fill_viridis(name = "f(x, y)", option = "viridis") + # geom_segment( # data = path_momentum, # aes(x = x, y = y, xend = xend, yend = yend, group = 1), # arrow = arrow(type = "closed", length = unit(0.2, "inches")), # color = "white", # linewidth = 0.8 # ) + # labs( # title = "Momentum Optimizer on f(x, y) = x^2 + y^2", # subtitle = "Step: {closest_state}", # x = "x", # y = "y" # ) + # coord_cartesian(xlim = c(-2, 5), ylim = c(-2, 5)) + # Set axis limits # theme_minimal() + # theme( # plot.title = element_text(size = 22, face = "bold"), # plot.subtitle = element_text(size = 24, color = 'darkblue'), # axis.title = element_text(size = 18), # legend.title = element_text(size = 18), # legend.text = element_text(size = 14), # axis.text = element_text(size = 14) # ) + # transition_states(path_momentum$step, transition_length = 2, state_length = 1) + # shadow_mark(size = 0.5, color = "gray") # # # Save as a GIF # anim_save("presentation_files/momentum_optimizer_arrows.gif", animation = animate(plot_momentum, fps = 5, width = 800, height = 600)) # # cat("GIF saved as 'momentum_optimizer_arrows.gif'\n") ``````{r show-gif-3, echo=FALSE, out.width="80%", fig.align="center"} knitr::include_graphics("presentation_files/momentum_optimizer_arrows.gif") ```- **Explanation**: This visualization starts at the initial point (4,4) and simulates the updates using the momentum optimization technique. At each step - The gradient is calculated, and a fraction of the previous velocity is added to the current gradient update. - This velocity term reduces oscillations and accelerates convergence toward the global minimum. - Arrows indicate the direction and magnitude of updates. The momentum term helps the optimizer move smoothly and quickly, demonstrating how it handles optimization in a convex landscape.### **AdaGrad Optimizer**- **Mathematical Intuition:** AdaGrad adapts the learning rate for each parameter based on the historical gradients. Parameters that have frequently large gradients get smaller updates, and parameters with infrequent large gradients get larger updates.- **Update Rule:** $$ g_t = \nabla f(\theta_t) $$ $$ G_t = G_{t-1} + g_t^2 $$ $$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot g_t $$ where: - GtGt is the sum of squared gradients, - ϵϵ is a small constant to prevent division by zero.- AdaGrad ensures that frequently updated parameters have smaller learning rates.- **R Implementation:**```{r} # library(viridis) # library(gganimate) # # AdaGrad Optimizer Function # adagrad_optimizer <- function(x_init, y_init, learning_rate, epsilon, n_iter) { # x <- x_init # y <- y_init # G_x <- 0 # Initialize sum of squared gradients for x # G_y <- 0 # Initialize sum of squared gradients for y # path <- data.frame(step = 0, x = x, y = y, xend = NA, yend = NA) # # for (i in 1:n_iter) { # grad_x <- 2 * x # Gradient with respect to x # grad_y <- 2 * y # Gradient with respect to y # # # Update sum of squared gradients # G_x <- G_x + grad_x^2 # G_y <- G_y + grad_y^2 # # # Update parameters # dx <- -(learning_rate / (sqrt(G_x) + epsilon)) * grad_x # dy <- -(learning_rate / (sqrt(G_y) + epsilon)) * grad_y # x_new <- x + dx # y_new <- y + dy # # path <- rbind(path, data.frame(step = i, x = x, y = y, xend = x_new, yend = y_new)) # x <- x_new # y <- y_new # } # # return(path) # } # # # Initial parameters # x_init <- 4 # y_init <- 4 # learning_rate <- 0.1 # epsilon <- 1e-8 # n_iter <- 50 # # # Generate path # path_adagrad <- adagrad_optimizer(x_init, y_init, learning_rate, epsilon, n_iter) # # # Remove rows with missing xend or yend values # path_adagrad <- na.omit(path_adagrad) # # # Create the quadratic function surface # x_vals <- seq(-5, 5, length.out = 100) # y_vals <- seq(-5, 5, length.out = 100) # grid <- expand.grid(x = x_vals, y = y_vals) # grid$z <- grid$x^2 + grid$y^2 # # # Plotting the AdaGrad optimizer trajectory with arrows # plot_adagrad <- ggplot() + # geom_raster(data = grid, aes(x = x, y = y, fill = z), alpha = 0.8) + # scale_fill_viridis(name = "f(x, y)", option = "viridis") + # geom_segment( # data = path_adagrad, # aes(x = x, y = y, xend = xend, yend = yend, group = 1), # arrow = arrow(type = "closed", length = unit(0.2, "inches")), # color = "white", # linewidth = 0.8 # ) + # labs( # title = "AdaGrad Optimizer on f(x, y) = x^2 + y^2", # subtitle = "Step: {closest_state}", # x = "x", # y = "y" # ) + # coord_cartesian(xlim = c(-2, 5), ylim = c(-2, 5)) + # Set axis limits # theme_minimal() + # theme( # plot.title = element_text(size = 22, face = "bold"), # plot.subtitle = element_text(size = 24, color = 'darkblue'), # axis.title = element_text(size = 18), # legend.title = element_text(size = 18), # legend.text = element_text(size = 14), # axis.text = element_text(size = 14) # ) + # transition_states(path_adagrad$step, transition_length = 2, state_length = 1) + # shadow_mark(size = 0.5, color = "gray") # # # Save as a GIF # anim_save("presentation_files/adagrad_optimizer_arrows.gif", animation = animate(plot_adagrad, fps = 5, width = 800, height = 600)) # # cat("GIF saved as 'adagrad_optimizer_arrows.gif'\n") ``````{r show-gif-4, echo=FALSE, out.width="80%", fig.align="center"} knitr::include_graphics("presentation_files/adagrad_optimizer_arrows.gif") ```- **Explanation**: We start at the initial point ( 4 , 4 ) on the quadratic surface 𝑓 ( 𝑥 , 𝑦 ) = 𝑥 2 + 𝑦 2 . The AdaGrad optimizer updates the learning rate for each parameter adaptively, based on the accumulated squared gradients. Parameters with frequently large gradients get smaller updates. Parameters with infrequent large gradients get larger updates. Arrows represent the direction and magnitude of updates at each step. The trajectory converges toward the global minimum at ( 0 , 0 ), demonstrating how AdaGrad adjusts its learning rate dynamically to achieve convergence in a smooth, convex optimization landscape.### **RMSProp Optimizer**- **Mathematical Intuition:** RMSProp is an adaptive learning rate method that divides the learning rate by a moving average of the squared gradients. It works similarly to AdaGrad but with a smoothing term to prevent the learning rate from decaying too quickly.- **Update Rule:** - Compute the first moment (gradient) gtgt: $$ g_t = \nabla f(\theta_t) $$ - Update the second moment vtvt with exponential decay: $$ v_t = \beta v_{t-1} + (1 - \beta) g_t^2 $$ - Update the parameter $$\theta_{t+1}$$ using the RMSprop rule: $$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} \cdot g_t $$ where: - vtvt is the moving average of squared gradients, - ββ is the smoothing constant, typically set to 0.9, - ϵϵ is a small constant for numerical stability. - RMSProp adapts the learning rate based on recent gradient information, helping to prevent the learning rate from shrinking too quickly.- **R Implementation:**```{r} # library(viridis) # library(gganimate) # # # RMSProp Optimization on Quadratic Function # # # RMSProp Optimizer Function # rmsprop_optimizer <- function(x_init, y_init, learning_rate, beta, epsilon, n_iter) { # x <- x_init # y <- y_init # v_x <- 0 # Initialize moving average for x # v_y <- 0 # Initialize moving average for y # path <- data.frame(step = 0, x = x, y = y, xend = NA, yend = NA) # # for (i in 1:n_iter) { # grad_x <- 2 * x # Gradient with respect to x # grad_y <- 2 * y # Gradient with respect to y # # # Update moving averages of squared gradients # v_x <- beta * v_x + (1 - beta) * grad_x^2 # v_y <- beta * v_y + (1 - beta) * grad_y^2 # # # Update parameters # dx <- -(learning_rate / (sqrt(v_x) + epsilon)) * grad_x # dy <- -(learning_rate / (sqrt(v_y) + epsilon)) * grad_y # x_new <- x + dx # y_new <- y + dy # # path <- rbind(path, data.frame(step = i, x = x, y = y, xend = x_new, yend = y_new)) # x <- x_new # y <- y_new # } # # return(path) # } # # # Initial parameters # x_init <- 4 # y_init <- 4 # learning_rate <- 0.1 # beta <- 0.9 # epsilon <- 1e-8 # n_iter <- 50 # # # Generate path # path_rmsprop <- rmsprop_optimizer(x_init, y_init, learning_rate, beta, epsilon, n_iter) # # # Remove rows with missing xend or yend values # path_rmsprop <- na.omit(path_rmsprop) # # # Create the quadratic function surface # x_vals <- seq(-5, 5, length.out = 100) # y_vals <- seq(-5, 5, length.out = 100) # grid <- expand.grid(x = x_vals, y = y_vals) # grid$z <- grid$x^2 + grid$y^2 # # # Plotting the RMSProp optimizer trajectory with arrows # plot_rmsprop <- ggplot() + # geom_raster(data = grid, aes(x = x, y = y, fill = z), alpha = 0.8) + # scale_fill_viridis(name = "f(x, y)", option = "viridis") + # geom_segment( # data = path_rmsprop, # aes(x = x, y = y, xend = xend, yend = yend, group = 1), # arrow = arrow(type = "closed", length = unit(0.2, "inches")), # color = "white", # linewidth = 0.8 # ) + # labs( # title = "RMSProp Optimizer on f(x, y) = x^2 + y^2", # subtitle = "Step: {closest_state}", # x = "x", # y = "y" # ) + # coord_cartesian(xlim = c(-2, 5), ylim = c(-2, 5)) + # Set axis limits # theme_minimal() + # theme( # plot.title = element_text(size = 22, face = "bold"), # plot.subtitle = element_text(size = 24, color = 'darkblue'), # axis.title = element_text(size = 18), # legend.title = element_text(size = 18), # legend.text = element_text(size = 14), # axis.text = element_text(size = 14) # ) + # transition_states(path_rmsprop$step, transition_length = 2, state_length = 1) + # shadow_mark(size = 0.5, color = "gray") # # # Save as a GIF # anim_save("presentation_files/rmsprop_optimizer_arrows.gif", animation = animate(plot_rmsprop, fps = 5, width = 800, height = 600)) # # cat("GIF saved as 'rmsprop_optimizer_arrows.gif'\n") ``````{r show-gif-5, echo=FALSE, out.width="80%", fig.align="center"} knitr::include_graphics("presentation_files/rmsprop_optimizer_arrows.gif") ```<!-- -->- **Explanation**: - We start at the initial point (4,4) on the quadratic surface f(x, y) = x\^2 + y\^2. - The RMSProp optimizer updates the learning rate dynamically using a moving average of the squared gradients. This ensures: - Gradients that occur frequently do not overly diminish the learning rate. - Gradients that occur rarely still allow meaningful updates. - The momentum-like smoothing term (β) helps prevent the learning rate from decaying too quickly, unlike AdaGrad. - Arrows represent the direction and magnitude of updates, showcasing the efficiency of RMSProp in converging toward the global minimum at (0,0). This visualization highlights the balance RMSProp strikes between learning rate adaptability and update stability.### **Adam Optimizer**- **Mathematical Intuition:** Adam (Adaptive Moment Estimation) combines the ideas of both **Momentum** and **RMSProp**. It tracks both the first moment (mean) and second moment (variance) of the gradients to adapt the learning rate for each parameter. It also uses bias correction to account for the initialization of the first and second moment estimates.- **Update Rule:** 1. **First Moment Estimate (Momentum):**$$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$ Where $$m_t$$ is the first moment estimate (momentum) and $$g_t$$ is the gradient at time t. 2. **Second Moment Estimate (RMSprop-like):** $$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$ Where $$v_t$$ is the second moment estimate (similar to RMSprop). 3. **Bias-Corrected First Moment Estimate:** $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$ This corrects the bias introduced by the initialization of mtmt. 4. **Bias-Corrected Second Moment Estimate:** $$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$ This corrects the bias introduced by the initialization of vtvt. 5. **Parameter Update:** $$\theta_{t+1} = \theta_t - \frac {\eta}{\sqrt{\hat{v}_t + \epsilon}} \hat{m}_t$$ 6. Where: - η is the learning rate, - ϵ is a small constant for numerical stability, - m is the bias-corrected first moment estimate, - v is the bias-corrected second moment estimate.- Adam is highly effective and widely used due to its ability to adapt the learning rate(based on the gradient's first and second moments) and improve convergence for a wide variety of tasks.- **R Implementation:**```{r} # library(viridis) # library(gganimate) # # Adam Optimizer Function # adam_optimizer <- function(x_init, y_init, learning_rate, beta1, beta2, epsilon, n_iter) { # x <- x_init # y <- y_init # m_x <- 0 # Initialize first moment estimate for x # m_y <- 0 # Initialize first moment estimate for y # v_x <- 0 # Initialize second moment estimate for x # v_y <- 0 # Initialize second moment estimate for y # path <- data.frame(step = 0, x = x, y = y, xend = NA, yend = NA) # # for (i in 1:n_iter) { # grad_x <- 2 * x # Gradient with respect to x # grad_y <- 2 * y # Gradient with respect to y # # # Update first moment estimate (mean of gradients) # m_x <- beta1 * m_x + (1 - beta1) * grad_x # m_y <- beta1 * m_y + (1 - beta1) * grad_y # # # Update second moment estimate (variance of gradients) # v_x <- beta2 * v_x + (1 - beta2) * grad_x^2 # v_y <- beta2 * v_y + (1 - beta2) * grad_y^2 # # # Bias correction for moment estimates # m_x_hat <- m_x / (1 - beta1^i) # m_y_hat <- m_y / (1 - beta1^i) # v_x_hat <- v_x / (1 - beta2^i) # v_y_hat <- v_y / (1 - beta2^i) # # # Update parameters # dx <- -(learning_rate / (sqrt(v_x_hat) + epsilon)) * m_x_hat # dy <- -(learning_rate / (sqrt(v_y_hat) + epsilon)) * m_y_hat # x_new <- x + dx # y_new <- y + dy # # path <- rbind(path, data.frame(step = i, x = x, y = y, xend = x_new, yend = y_new)) # x <- x_new # y <- y_new # } # # return(path) # } # # # Initial parameters # x_init <- 4 # y_init <- 4 # learning_rate <- 0.1 # beta1 <- 0.9 # beta2 <- 0.999 # epsilon <- 1e-8 # n_iter <- 50 # # # Generate path # path_adam <- adam_optimizer(x_init, y_init, learning_rate, beta1, beta2, epsilon, n_iter) # # # Remove rows with missing xend or yend values # path_adam <- na.omit(path_adam) # # # Create the quadratic function surface # x_vals <- seq(-5, 5, length.out = 100) # y_vals <- seq(-5, 5, length.out = 100) # grid <- expand.grid(x = x_vals, y = y_vals) # grid$z <- grid$x^2 + grid$y^2 # # # Plotting the Adam optimizer trajectory with arrows # plot_adam <- ggplot() + # geom_raster(data = grid, aes(x = x, y = y, fill = z), alpha = 0.8) + # scale_fill_viridis(name = "f(x, y)", option = "viridis") + # geom_segment( # data = path_adam, # aes(x = x, y = y, xend = xend, yend = yend, group = 1), # arrow = arrow(type = "closed", length = unit(0.2, "inches")), # color = "white", # linewidth = 0.8 # ) + # labs( # title = "Adam Optimizer on f(x, y) = x^2 + y^2", # subtitle = "Step: {closest_state}", # x = "x", # y = "y" # ) + # coord_cartesian(xlim = c(-2, 5), ylim = c(-2, 5)) + # Set axis limits # theme_minimal() + # theme( # plot.title = element_text(size = 22, face = "bold"), # plot.subtitle = element_text(size = 24, color = 'darkblue'), # axis.title = element_text(size = 18), # legend.title = element_text(size = 18), # legend.text = element_text(size = 14), # axis.text = element_text(size = 14) # ) + # transition_states(path_adam$step, transition_length = 2, state_length = 1) + # shadow_mark(size = 0.5, color = "gray") # # # Save as a GIF # anim_save("presentation_files/adam_optimizer_arrows.gif", animation = animate(plot_adam, fps = 5, width = 800, height = 600)) # # cat("GIF saved as 'adam_optimizer_arrows.gif'\n") ``````{r show-gif-6, echo=FALSE, out.width="80%", fig.align="center"} knitr::include_graphics("presentation_files/adam_optimizer_arrows.gif") ```- **Explanation**: - The optimizer starts at the initial point (4,4) on the quadratic surface f(x, y) = x\^2 + y\^2. - **Adam** combines the benefits of: - **Momentum**: Using the first moment estimate (mtm_tmt) to smooth updates and reduce oscillations. - **RMSProp**: Using the second moment estimate (vtv_tvt) to adapt the learning rate dynamically for each parameter. - Bias correction ensures that the first few updates are unbiased, despite starting with zero moments. - Arrows represent the direction and magnitude of updates at each step, showing the optimizer’s ability to adapt learning rates and converge efficiently to the global minimum at (0,0).### Visualizing the different optimizers behavior```{r}# # Define the quadratic function# quadratic_function <- function(x, y) {# return(x^2 + y^2)# }# # # Generate the 3D surface as a data frame# generate_surface <- function() {# x <- seq(-5, 5, length.out = 100)# y <- seq(-5, 5, length.out = 100)# expand.grid(x = x, y = y) %>%# mutate(z = quadratic_function(x, y))# }# # surface <- generate_surface()# # # Gradient Descent# gradient_descent <- function(x_init, y_init, learning_rate, n_iter) {# x <- x_init# y <- y_init# path <- data.frame(step = 0, x = x, y = y, z = quadratic_function(x, y), method = "GD")# # for (i in 1:n_iter) {# grad <- c(2 * x, 2 * y)# x <- x - learning_rate * grad[1]# y <- y - learning_rate * grad[2]# path <- rbind(path, data.frame(step = i, x = x, y = y, z = quadratic_function(x, y), method = "GD"))# }# return(path)# }# # # Stochastic Gradient Descent# stochastic_gradient_descent <- function(x_init, y_init, learning_rate, n_iter) {# x <- x_init# y <- y_init# path <- data.frame(step = 0, x = x, y = y, z = quadratic_function(x, y), method = "SGD")# # for (i in 1:n_iter) {# grad <- c(2 * x, 2 * y) + runif(2, -0.1, 0.1)# x <- x - learning_rate * grad[1]# y <- y - learning_rate * grad[2]# path <- rbind(path, data.frame(step = i, x = x, y = y, z = quadratic_function(x, y), method = "SGD"))# }# return(path)# }# # # Momentum# momentum_optimizer <- function(x_init, y_init, learning_rate, beta, n_iter) {# x <- x_init# y <- y_init# v_x <- 0# v_y <- 0# path <- data.frame(step = 0, x = x, y = y, z = quadratic_function(x, y), method = "Momentum")# # for (i in 1:n_iter) {# grad <- c(2 * x, 2 * y)# v_x <- beta * v_x + (1 - beta) * grad[1]# v_y <- beta * v_y + (1 - beta) * grad[2]# x <- x - learning_rate * v_x# y <- y - learning_rate * v_y# path <- rbind(path, data.frame(step = i, x = x, y = y, z = quadratic_function(x, y), method = "Momentum"))# }# return(path)# }# # # AdaGrad# adagrad_optimizer <- function(x_init, y_init, learning_rate, epsilon, n_iter) {# x <- x_init# y <- y_init# G_x <- 0# G_y <- 0# path <- data.frame(step = 0, x = x, y = y, z = quadratic_function(x, y), method = "AdaGrad")# # for (i in 1:n_iter) {# grad <- c(2 * x, 2 * y)# G_x <- G_x + grad[1]^2# G_y <- G_y + grad[2]^2# x <- x - (learning_rate / (sqrt(G_x) + epsilon)) * grad[1]# y <- y - (learning_rate / (sqrt(G_y) + epsilon)) * grad[2]# path <- rbind(path, data.frame(step = i, x = x, y = y, z = quadratic_function(x, y), method = "AdaGrad"))# }# return(path)# }# # # RMSProp# rmsprop_optimizer <- function(x_init, y_init, learning_rate, beta, epsilon, n_iter) {# x <- x_init# y <- y_init# v_x <- 0# v_y <- 0# path <- data.frame(step = 0, x = x, y = y, z = quadratic_function(x, y), method = "RMSProp")# # for (i in 1:n_iter) {# grad <- c(2 * x, 2 * y)# v_x <- beta * v_x + (1 - beta) * grad[1]^2# v_y <- beta * v_y + (1 - beta) * grad[2]^2# x <- x - (learning_rate / (sqrt(v_x) + epsilon)) * grad[1]# y <- y - (learning_rate / (sqrt(v_y) + epsilon)) * grad[2]# path <- rbind(path, data.frame(step = i, x = x, y = y, z = quadratic_function(x, y), method = "RMSProp"))# }# return(path)# }# # # Adam# adam_optimizer <- function(x_init, y_init, learning_rate, beta1, beta2, epsilon, n_iter) {# x <- x_init# y <- y_init# m_x <- 0# m_y <- 0# v_x <- 0# v_y <- 0# path <- data.frame(step = 0, x = x, y = y, z = quadratic_function(x, y), method = "Adam")# # for (i in 1:n_iter) {# grad <- c(2 * x, 2 * y)# m_x <- beta1 * m_x + (1 - beta1) * grad[1]# m_y <- beta1 * m_y + (1 - beta1) * grad[2]# v_x <- beta2 * v_x + (1 - beta2) * grad[1]^2# v_y <- beta2 * v_y + (1 - beta2) * grad[2]^2# m_x_hat <- m_x / (1 - beta1^i)# m_y_hat <- m_y / (1 - beta1^i)# v_x_hat <- v_x / (1 - beta2^i)# v_y_hat <- v_y / (1 - beta2^i)# x <- x - (learning_rate / (sqrt(v_x_hat) + epsilon)) * m_x_hat# y <- y - (learning_rate / (sqrt(v_y_hat) + epsilon)) * m_y_hat# path <- rbind(path, data.frame(step = i, x = x, y = y, z = quadratic_function(x, y), method = "Adam"))# }# return(path)# }# # # Generate paths for all optimizers# # Generate optimizer trajectories# n_iter <- 50# learning_rate <- 0.1# trajectories <- bind_rows(# gradient_descent(4, 4, learning_rate, n_iter),# stochastic_gradient_descent(4, 4, learning_rate, n_iter),# momentum_optimizer(4, 4, learning_rate, beta = 0.9, n_iter),# adagrad_optimizer(4, 4, learning_rate, epsilon = 1e-8, n_iter),# rmsprop_optimizer(4, 4, learning_rate, beta = 0.9, epsilon = 1e-8, n_iter),# adam_optimizer(4, 4, learning_rate, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8, n_iter)# )# # # Update labels to include method abbreviation# trajectories <- trajectories %>%# mutate(label = paste(method, step))# # # Assign colors and shapes to optimizers# shapes <- c("GD" = 15, "SGD" = 17, "Momentum" = 18, "AdaGrad" = 19, "RMSProp" = 24, "Adam" = 25)# colors <- viridis::viridis(6) # Define distinct colors for optimizers# # # Create animated plot# animated_plot <- ggplot() +# # 3D surface# geom_tile(data = surface, aes(x = x, y = y, fill = z), alpha = 0.8) +# scale_fill_viridis_c(name = "f(X, Y)") + # Continuous scale for the surface# # Optimizer trajectories# geom_point(data = trajectories, aes(x = x, y = y, color = method, shape = method), size = 7, stroke = 0.5) +# geom_text_repel(data = trajectories, aes(x = x, y = y, label = label, color = method),# size = 6, box.padding = 0.5, point.padding = 0.5, segment.color = "grey50") +# scale_color_manual(values = colors, name = "Optimizer") + # Discrete colors for optimizers# scale_shape_manual(values = shapes) + # Discrete shapes for optimizers# # Labels and theme# labs(# title = "Optimization Trajectories to find Global Minimum",# subtitle = "Deep learning optimizer dynamics",# x = "X",# y = "Y"# ) +# coord_cartesian(xlim = c(-2, 5), ylim = c(-2, 5)) + # Set axis limits# theme_minimal() +# theme(# plot.title = element_text(size = 22, face = "bold"),# plot.subtitle = element_text(size = 22),# legend.title = element_text(size = 18),# legend.text = element_text(size = 14),# axis.title = element_text(size = 18),# axis.text = element_text(size = 14)# ) +# transition_states(step, wrap = FALSE)# # # Save the animation as a GIF# animate(animated_plot, fps = 5, width = 800, height = 600, renderer = gifski_renderer("presentation_files/optimization_3d_fixed.gif"))``````{r show-gif-7, echo=FALSE, out.width="80%", fig.align="center"}knitr::include_graphics("presentation_files/optimization_3d_fixed.gif")```### **Efficiency Comparison**1. **Speed**: - **Fastest**: Adam, RMSProp, and SGD (for large datasets). - **Slowest**: Gradient Descent due to computation over the entire dataset at each step.2. **Convergence Stability**: - **Stable**: Momentum, Adam, and RMSProp are robust to oscillations. - **Noisy**: SGD can oscillate or overshoot due to randomness.3. **Adaptability**: - **Most Adaptive**: Adam (adjusts learning rate dynamically for each parameter). - **Least Adaptive**: Gradient Descent (fixed learning rate for all parameters).4. **Handling Sparse Data**: - **Best**: AdaGrad and Adam are designed for sparse gradients. - **Worst**: Gradient Descent and Momentum may struggle without careful tuning.### **Conclusion**- **Adam** is widely regarded as the most efficient and versatile optimizer for general use cases due to its adaptive nature, fast convergence, and stability. However, for very specific tasks, other methods like RMSProp or AdaGrad may perform better.### 2D static images```{r}# options(warn = -1) # Suppress warnings# log(-1) # No warning will be shown# # options(warn = 0) # Restore warnings# # # Function to plot optimizer trajectory with arrows and step numbers# plot_optimizer_with_arrows <- function(trajectory, title) {# ggplot() +# # Surface plot# geom_tile(data = surface, aes(x = x, y = y, fill = z), alpha = 0.8) +# scale_fill_viridis_c(name = "f(X, Y)") +# # Trajectory arrows with black border# geom_segment(data = trajectory, aes(x = x, y = y, xend = lead(x), yend = lead(y)),# arrow = arrow(type = "closed", length = unit(0.1, "inches")),# color = "black", size = 0.8) + # Black border for arrows# geom_segment(data = trajectory, aes(x = x, y = y, xend = lead(x), yend = lead(y)),# arrow = arrow(type = "closed", length = unit(0.1, "inches")),# color = "white", size = 0.5) + # Red inner arrow# # Step labels# geom_text_repel(data = trajectory, aes(x = x, y = y, label = step), size = 3, color = "white", vjust = -2) +# labs(# title = title,# x = "X",# y = "Y"# ) +# coord_cartesian(xlim = c(-2, 5), ylim = c(-2, 5)) + # Set axis limits# theme_bw() +# theme(# plot.title = element_text(size = 22, face = "bold"),# axis.title = element_text(size = 16),# axis.text = element_text(size = 14),# legend.text = element_text(size = 14),# legend.title = element_text(size = 14)# )# }# # # Generate trajectories for each optimizer# gradient_trajectory <- gradient_descent(4, 4, learning_rate = 0.1, n_iter = 50)# sgd_trajectory <- stochastic_gradient_descent(4, 4, learning_rate = 0.1, n_iter = 50)# momentum_trajectory <- momentum_optimizer(4, 4, learning_rate = 0.1, beta = 0.9, n_iter = 50)# adagrad_trajectory <- adagrad_optimizer(4, 4, learning_rate = 0.1, epsilon = 1e-8, n_iter = 50)# rmsprop_trajectory <- rmsprop_optimizer(4, 4, learning_rate = 0.1, beta = 0.9, epsilon = 1e-8, n_iter = 50)# adam_trajectory <- adam_optimizer(4, 4, learning_rate = 0.1, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8, n_iter = 50)# # # Create and save plots for each optimizer# ggsave("presentation_files/gradient_descent_arrows.png", plot_optimizer_with_arrows(gradient_trajectory, "Gradient Descent Trajectory"), width = 8, height = 6)# ggsave("presentation_files/sgd_arrows.png", plot_optimizer_with_arrows(sgd_trajectory, "SGD Trajectory"), width = 8, height = 6)# ggsave("presentation_files/momentum_arrows.png", plot_optimizer_with_arrows(momentum_trajectory, "Momentum Trajectory"), width = 8, height = 6)# ggsave("presentation_files/adagrad_arrows.png", plot_optimizer_with_arrows(adagrad_trajectory, "AdaGrad Trajectory"), width = 8, height = 6)# ggsave("presentation_files/rmsprop_arrows.png", plot_optimizer_with_arrows(rmsprop_trajectory, "RMSProp Trajectory"), width = 8, height = 6)# ggsave("presentation_files/adam_arrows.png", plot_optimizer_with_arrows(adam_trajectory, "Adam Trajectory"), width = 8, height = 6)``````{r show-gif-8, echo=FALSE, out.width="80%", fig.align="center"}knitr::include_graphics("presentation_files/gradient_descent_arrows.png")knitr::include_graphics("presentation_files/sgd_arrows.png")knitr::include_graphics("presentation_files/momentum_arrows.png")knitr::include_graphics("presentation_files/adagrad_arrows.png")knitr::include_graphics("presentation_files/rmsprop_arrows.png")knitr::include_graphics("presentation_files/adam_arrows.png")```### Comparison Between Optimizer's| **Optimizer** | **Description** | **Pros** | **Cons** | **Best Use** ||----------------------|------------------------------------|-----------------------------|-----------------------------|-----------------------------------|| **Gradient Descent** | Full gradient each update | Stable convergence | Slow, costly | Small, static datasets || **SGD** | One sample/batch per update | Fast, escapes minima | High variance, less stable | Large datasets, limited resources || **Momentum** | Adds momentum to smooth updates | Faster, reduces oscillation | Adds tuning complexity | Oscillatory paths || **AdaGrad** | Adapts learning rate per parameter | Good for sparse data | Learning rate decays | Sparse datasets || **RMSProp** | Moving avg. squared gradients | Solves AdaGrad decay issue | Needs decay rate tuning | Non-stationary tasks (e.g., RNNs) || **Adam** | Combines Momentum + RMSProp | Fast, robust to noise | Complex, more tuning needed | General-purpose neural networks |